Modeling Word Meaning: Distributional Semantics and the Corpus Quality-Quantity Trade-Off
نویسندگان
چکیده
Dictionaries constructed using distributional models of lexical semantics have a wide range of applications in NLP and in the modeling of linguistic cognition. However when constructing such a model, we are faced with range of corpora to choose from. Often there is a choice between small carefully constructed corpora of well-edited text, and very large heterogeneous collections harvested automatically from the web. There may also be differences in the distribution of genres and registers in such corpora. In this paper we examine these trade-offs by constructing a simple SVD-reduced word-collocate model, using four English corpora: the Google Web 5-gram collection, the Google Book 5-gram collection, the English Wikipedia, and collection of short social messages harvested from Twitter. Since these models need to encode semantics in a way that approximates the mental lexicon, we evaluate the felicity of the resulting semantic representations using a set of behavioral and neural-activity benchmarks that depend on wordsimilarity. We find that the quality of the input text has a very strong effect on the performance of the output model, and that a corpus of high quality at a small size can outperform a corpus of poor quality that is many orders of magnitude larger. We also explore the semantic closeness of the models using their mutual information overlap to interpret the similarity of corpus texts.
منابع مشابه
Compositional-ly Derived Representations of Morphologically Complex Words in Distributional Semantics
Speakers of a language can construct an unlimited number of new words through morphological derivation. This is a major cause of data sparseness for corpus-based approaches to lexical semantics, such as distributional semantic models of word meaning. We adapt compositional methods originally developed for phrases to the task of deriving the distributional meaning of morphologically complex word...
متن کاملRevisiting Word Embedding for Contrasting Meaning
Contrasting meaning is a basic aspect of semantics. Recent word-embedding models based on distributional semantics hypothesis are known to be weak for modeling lexical contrast. We present in this paper the embedding models that achieve an F-score of 92% on the widely-used, publicly available dataset, the GRE “most contrasting word” questions (Mohammad et al., 2008). This is the highest perform...
متن کاملCorpus Co-Occurrence, Dictionary and Wikipedia Entries as Resources for Semantic Relatedness Information
Distributional, corpus-based descriptions have frequently been applied to model aspects of word meaning. However, distributional models that use corpus data as their basis have one well-known disadvantage: Even though the distributional features based on corpus co-occurrence were often successful in capturing meaning aspects of the words to be described, they generally fail to capture those mea...
متن کاملFrom distributional to semantic similarity
Lexical-semantic resources, including thesauri and WORDNET, have been successfully incorporated into a wide range of applications in Natural Language Processing. However they are very difficult and expensive to create and maintain, and their usefulness has been severely hampered by their limited coverage, bias and inconsistency. Automated and semi-automated methods for developing such resources...
متن کاملA Compositional Distributional Semantics, Two Concrete Constructions, and Some Experimental Evaluations
We provide an overview of the hybrid compositional distributional model of meaning, developed in [6], which is based on the categorical methods also applied to the analysis of information flow in quantum protocols. The mathematical setting stipulates that the meaning of a sentence is a linear function of the tensor products of the meanings of its words. We provide concrete constructions for thi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013